Skip to content

Improve test coverage and report#119

Open
skyw wants to merge 15 commits intomainfrom
skyw/ai_aided_test_improvement
Open

Improve test coverage and report#119
skyw wants to merge 15 commits intomainfrom
skyw/ai_aided_test_improvement

Conversation

@skyw
Copy link
Contributor

@skyw skyw commented Mar 6, 2026

Half done by Claude Code and I reviewed AI written code.

some tiny bugs a fixed along with it.

skyw added 8 commits March 5, 2026 14:23
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
@skyw skyw requested a review from a team as a code owner March 6, 2026 00:40
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR improves test coverage for SOAP utilities, scalar optimizers, and precondition schedules, while fixing two bugs: (1) empty Kronecker factor eigenbasis shape corrected from (0,) to (0, 0) in both get_eigenbasis_eigh and get_eigenbasis_qr, and (2) met_approx_eigvals_criteria condition rewritten to an equivalent but arguably clearer form. CI reporting is also improved with per-test XML output files.

Key changes:

  • soap_utils.py: torch.empty(0)torch.empty(0, 0) for empty Kronecker factors in two places — correct fix.
  • eig.py: met_approx_eigvals_criteria formula is algebraically equivalent to the old form; also fixes the return docstring and a parameter name in orthogonal_iteration.
  • test_soap_utils.py: New tests for zero-dim eigenbasis, empty QR factor, and all_eigenbases_met_criteria (positive and negative cases). The positive case now correctly uses .eigenvectors. A minor device inconsistency exists: torch.empty(0, 0) in the dims=[64, 0, 32] parameterized case of test_get_eigenbasis_eigh is not given device=self.device, which produces a CPU tensor in an otherwise-CUDA list on GPU runs.
  • test_scalar_optimizers.py: AdEMAMix parameterization fixes (shadowed variable bugs removed), plus two new Lion update tests. The num_beta_slow_warmup_steps=2 variant in the AdEMAMix test doesn't exercise the slow-beta schedule path since alpha=0 completely nullifies the slow EMA contribution.
  • test_soap.py: Schedule tests reorganised into a new ScheduleTest class with CosineSchedule and StepSchedule cases added.
  • L0_Tests_CPU.sh: Per-test XML reports and a loop deduplicating the torchrun calls; test_soap.py and test_soap_utils.py are still not wired into CI.

Confidence Score: 4/5

  • Safe to merge; changes are test improvements and minor bug fixes with no risk to production optimizer logic.
  • The two production-code fixes (empty tensor shape and formula rewrite in met_approx_eigvals_criteria) are mathematically sound and well-tested by the new cases. The only actionable finding is a device inconsistency on torch.empty(0, 0) in the dims=[64, 0, 32] test case, which is harmless on CPU CI but would surface on CUDA runs.
  • tests/test_soap_utils.py (device inconsistency in empty-tensor parameterized case)

Comments Outside Diff (1)

  1. tests/test_soap_utils.py, line 137 (link)

    Missing device=self.device on empty tensor in parameterized test.

    When the new {"dims": [64, 0, 32]} parameterized case is run on a CUDA device, torch.empty(0, 0) defaults to CPU while the dim=64 and dim=32 tensors are created on self.device (CUDA). The returned Q_list[1] will also be on CPU (since get_eigenbasis_eigh returns torch.empty(0, 0, device=kronecker_factor.device)), leaving a device-inconsistent output list. Any downstream consumer that feeds the full Q_list into a CUDA kernel would hit a device mismatch error.

Last reviewed commit: 1215f18

Comment on lines +199 to +203
def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None:
kronecker_factor_list = [torch.randn(N, N, device=self.device)]

eigenbasis_list = [torch.diag(torch.linalg.eigh(K).eigenvalues) for K in kronecker_factor_list]
self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong eigh attribute used — .eigenvalues instead of .eigenvectors

torch.linalg.eigh returns a named tuple with .eigenvalues (1-D vector λ) and .eigenvectors (N×N orthogonal matrix Q). The test wraps the 1-D eigenvalues in torch.diag(), producing a diagonal eigenvalue matrix D, and passes that to all_eigenbases_met_criteria.

However, the conjugate function (used internally) assumes its second argument is an orthogonal matrix. Passing a diagonal eigenvalue matrix instead breaks this invariant. The met_approx_eigvals_criteria check will compute a meaningless result and likely pass by chance, so the test does not validate the intended mathematical property.

Additionally, K = torch.randn(N, N) on line 200 is not symmetric; calling torch.linalg.eigh on it is undefined behaviour (PyTorch silently uses only the lower triangular part).

The test should construct a symmetric matrix and use the eigenvectors:

def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None:
    g = torch.randn(N, N, device=self.device)
    K_sym = g @ g.T + torch.eye(N, device=self.device) * 1e-5  # symmetric PSD
    kronecker_factor_list = [K_sym]

    eigenbasis_list = [torch.linalg.eigh(K_sym).eigenvectors]
    self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkhona-nvidia , should we use eigen values or eigen vectors?

skyw and others added 2 commits March 5, 2026 16:52
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
@skyw
Copy link
Contributor Author

skyw commented Mar 6, 2026

/ok to test b4ea359

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Test Results

   46 files  + 12     96 suites  +38   1m 11s ⏱️ +2s
  955 tests + 38    955 ✅ + 38  0 💤 ±0  0 ❌ ±0 
2 141 runs  +316  2 141 ✅ +316  0 💤 ±0  0 ❌ ±0 

Results for commit 1215f18. ± Comparison against base commit 7056267.

This pull request removes 2 and adds 40 tests. Note that renamed tests count towards both.
__main__.ScalarOptimizerTest ‑ test_calculate_ademamix_update_with_alpha_zero_equals_adam
__main__.SoapFunctionsTest ‑ test_soap_optimizer_class_based_schedule
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_close_to_non_distributed0 (shape=(3, 32))
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_close_to_non_distributed1 (shape=(5, 100))
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_with_partial_tp_close_to_non_distributed0 (shape=(32, 3), transpose=True, tp_size=2)
__main__.DistributedNewtonSchulzCpuTest ‑ test_1step_with_partial_tp_close_to_non_distributed1 (shape=(5, 100), transpose=False, tp_size=4)
__main__.DistributedNewtonSchulzCpuTest ‑ test_5steps_with_transpose_close_to_non_distributed0 (shape=(32, 3), transpose=True)
__main__.DistributedNewtonSchulzCpuTest ‑ test_5steps_with_transpose_close_to_non_distributed1 (shape=(5, 100), transpose=False)
__main__.DistributedNewtonSchulzCpuTest ‑ test_distributed_normalize_close_to_non_distributed0 (shape=(21, 16))
__main__.DistributedNewtonSchulzCpuTest ‑ test_distributed_normalize_close_to_non_distributed1 (shape=(16, 32))
__main__.DistributedNewtonSchulzStepCpuTest ‑ test_close_to_non_distributed0 (shape=(21, 16))
__main__.DistributedNewtonSchulzStepCpuTest ‑ test_close_to_non_distributed1 (shape=(16, 32))
…

♻️ This comment has been updated with latest results.

@skyw skyw enabled auto-merge (squash) March 6, 2026 02:49
@skyw skyw requested a review from mkhona-nvidia March 6, 2026 03:29
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
@skyw
Copy link
Contributor Author

skyw commented Mar 6, 2026

/ok to test 23e4b0c

Signed-off-by: Hao Wu <skyw@nvidia.com>
@skyw
Copy link
Contributor Author

skyw commented Mar 6, 2026

/ok to test 0825b7e

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
@skyw
Copy link
Contributor Author

skyw commented Mar 6, 2026

/ok to test 1215f18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant